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0\ 1 Abstract 

O . 

We study source coding in the presence of side information, when the system can take actions that affect the availability, 
quality, or nature of the side information. We begin by extending the Wyner-Ziv problem of source coding with decoder 
side information to the case where the decoder is allowed to choose actions affecting the side information. We then consider 
the setting where actions are taken by the encoder, based on its observation of the source. Actions may have costs that are 
commensurate with the quality of the side information they yield, and an overall per-symbol cost constraint may be imposed. 
We characterize the achievable tradeoffs between rate, distortion, and cost in some of these problem settings. Among our 
findings is the fact that even in the absence of a cost constraint, greedily choosing the action associated with the 'best' side 
information is, in general, sub-optimal. A few examples are worked out. 
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I. Introduction 



' The role and potential benefit of Side Information (S.I.) in lossless and lossy data compression is a central theme in 
information theory. In ways that are well understood for various source coding systems, S.I. can be a valuable resource, 
resulting in significant performance boosts relative to the case where it is absent. In the problems studied thus far, the lack 

m ■ 

' or availability of the S.I., and its quality, are a given. But what if the system can take actions that affect the availability, 



quality, or nature of the S.I.? 

For example, consider a source coding system where the S.I. is a sequence of noisy measurements of the source sequence 
to be compressed, each S.I. symbol acquired via a sensor. The quality of each S.I. symbol may be commensurate with 



■ i 

, resources, such as power or time expended by the sensor for obtaining it, which are limited. Alternatively, or in addition, a 
■ sensor may have freedom to choose, for each source symbol, how many independent noisy measurements to observe, with 
a constraint on the overall number of measurements. It is then natural to wonder how these resources, which may or may 
not be limited, should best be used, and what would the corresponding optimum performance be. 

We abstract this problem by assuming a memoryless source Px, a conditional distribution of the side information given 
the source and an action Py\x,A> a function assigning costs to the possible actions, and a distortion measure. The first 
scenario we focus on is that depicted in Figure [U where the actions are taken at the decoder: Based on its observation of the 
source sequence X n , which is i.i.d.~ Px, the encoder gives an index to the decoder. Having received the index, the decoder 
chooses the action sequence A n . Nature then generates the side information sequence Y n as the output of the memoryless 
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Fig. 1. Rate distortion with a side information vender at the decoder. The source X n is i.i.d. ~ Px, and Y n is the output of the side information channel 
Py\x,A m response to the pair of sequences X n and A n , where A n is the action sequence chosen by the decoder. 



channel Py\x,A whose input is the pair (X n , A n ). The reconstruction sequence X n is then based on the index and on the 
side information sequence. 

The setting of Figure Q] can be considered the source coding dual of coding for channels with action-dependent states, 
where the transmitter chooses an action sequence that affects the formation of the channel states, and then creates the 
channel input sequence based on the state sequence, as considered in [11]. We characterize the achievable tradeoff between 
rate, distortion, and cost in Section [TT] We demonstrate, by a few examples, that greedily choosing the action associated 
with the 'best' side information may be sub-optimal even in the absence of a cost constraint. Further, in the presence of 
a cost constraint, time-sharing between schemes that are optimal for different cost values is, in general, sub-optimal. We 
also characterize the fundamental limits for the case where the reconstruction is confined to causal dependence on the side 
information sequence, and the case where the encoder observes a noisy observation of the source rather than the source 
itself. 

The second scenario we consider is that depicted in Figure [2] where actions are taken at the encoder: Based on its 
observation of the source sequence X n , the encoder chooses a sequence of actions A n . Nature then generates the side 
information sequence Y n as the output of the memoryless channel Py\x,A whose input is the pair (X n , A n ). The encoder 
now chooses the index to be given to the decoder on the basis of the source and possibly the side information sequence 
(according to whether or not the switch is closed). The reconstruction sequence X n is then based on the index and on the side 
information sequence. Though we leave the general case open, in Section [III] we characterize the achievable tradeoff between 
rate, distortion, and cost for three important special cases: the (near) lossless case, the Gaussian case (where Y = A + X + N, 
with X and N being independent Gaussian random variables), and the case of the Markov relation Y — A — X (i.e., when 
Py\x,A is °f the form Py\a)- We end that section with Subsection IIII-DI giving lower and upper bounds on the achievable 
rates for the general case. We summarize the paper and related open directions in Section [IV] 

The family of problems we consider in this work includes scenarios arising naturally in the coding or compression of 
sources for which the S.I. arises from noisy measurements of the source components. The acquisition, handling, processing 
and storage of these measurements may require system resources that come at a cost. This premise, that the acquisition of 
source measurements may be costly and is to be done sparingly, is in fact central in the emerging Compressed Sensing 
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Fig. 2. Rate distortion with side information vender at the encoder, where the side information is known at the decoder and may or may not be known 
to the encoder. The source X n is i.i.d.~ Px and side information is generated as the output of the memoryless channel Py\x,A m response to the input 
(X n , A n ), where the action sequence A n is chosen by the encoder. 



paradigm [1], [2], [5], arising naturally in the study of an increasing array of sensing problems. In many such problems, 
the system has the freedom to choose how many sensors to deploy in each region of the phenomenon it is trying to 
gauge, subject to an overall budget of sensors. Assuming each sensor provides an independent measurement of the source 
region in which it was deployed, this setting corresponds to our model, with Ai G {0, 1,2,...} representing the number of 
sensors, Py\x,A = Yif=i ^z^x representing A independent measurements from the 'sensor channel' Pz\x, an d A(Ai) = Ai 
assuming all sensors are equally costly. The cost constraint C then corresponds to the budget of sensors to deploy, in number 
of sensors per source region. We are not aware of previous work on source coding for systems allowed to take S.I. -affecting 
actions from a Shannon theoretic perspective. We refer to [7] and some references therein for other recent Shannon theoretic 
studies of new problems involving source coding in the presence of S.I. 

II. Side Information Vending Machine at the Decoder 

Throughout the paper we let upper case, lower case, and calligraphic letters denote, respectively, random variables, 
specific or deterministic values they may assume, and their alphabets. For two jointly distributed random objects X and 
Y, let Px, Px.y, and Px\y respectively denote the distribution of X, the joint distribution of X, Y, and the conditional 
distribution of X given Y. In particular, when X and Y are discrete, Px\y represents the stochastic matrix whose elements 
are Px\y{ x \v) = P(X = x\Y = y). The term X™ denotes the n — m + 1-tuple (X m , . . . , X n ) when m < n and the empty 
set otherwise. The term X n is shorthand for Xf, and X n \ l stands for the n — 1-tuple consisting of all the components of 
X n but X t . 

A. The Setup 

A source with action dependent decoder side information is characterized by the source distribution Px and by the 
conditional distribution of the side information given the source and an action Py\x,A- The difference between this and 
previously studied scenarios is that here, after receiving the index from the encoder, the decoder may choose actions that 
will affect the nature of the side information it will get to observe. Specifically, a scheme in this setting for blocklength n and 
rate R is characterized by an encoding function T : X n — ► {1,2,..., 2 nR }, an action strategy / : {1, 2, . . . , 2 nR } — ► A n , 
and a decoding function g : {1, 2, . . . , 2 nR } x y n — > X n that operate as follows: 
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• The source n-tuple X n is i.i.d. ~ P x 

• Encoding: based on X n give index T = T(X n ) to the decoder 

• Decoding: 

- given the index, choose an action sequence A n = f(T) 

- the side information Y n will be the output of the memoryless channel Py\x,A whose input is (X n ,A n ) 

- let X n = g(T,Y n ) 

A triple (R,D,C) is said to be achievable if for all e > and sufficiently large n there exists a scheme as above for 
blocklength n and rate R + e satisfying both 
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^2p(Xi,X t 



and 



E 
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< n{D + e) 



< n(C + e), 



(1) 



(2) 



where p and A are, respectively, given distortion and cost functions. The rate distortion (and cost) function R(D, C) is 
defined as 

R(D, C) = mi{R' : the triple (R',D, C) is achievable}. (3) 



B. The Rate Distortion Cost Tradeoff 
Define 

R (I) (D, C) = min [I(X: A) + I(X; U\Y, A)} , 
where the joint distribution of X, A, Y, U in (01) is of the form 

Px,a,u,y( x > a ' u ' v) = p x{x)P A ,u\x{a, u\x)P Y \x,a{v\x, a), 
and the minimization is over all Pa,u\x under which 



E 



p(x,X°P t (U,Y))]<D, E[A(A)]<C, 



(4) 



(5) 



(6) 



where X opt (U, Y) denotes the best estimate of X based on U, Y, U is an auxiliary random variable. We show below that 
the cardinality of U may be restricted to \U\ < \X\\A\ + 1. Our main result pertaining to R( J \D, C) is the following: 
Theorem 1: The rate distortion cost function, as defined in (O, is given by R^'(D, C) in ©, i.e., 



R{D,C) = R [I) {D,C). 



(7) 



Remark: Write Ryyz(Px, Py\x, D) f° r the explicit dependence of the Wyner-Ziv rate distortion function [15] on the 
distribution of the source and the conditional distribution of the source given the side information. It is clear that 



R(D, C) <min £ PA(a)R W z(Px,P Y \x,A=a, D a ) : £ P A (a)D a < D,J^ Px(o)A(o) < C \ , 

y. a a a ) 



(8) 
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since the right hand side can be achieved by letting the decoder take actions according to a pre-specified sequence with the 
symbol a fraction Pa (a) of the time, and performing Wyner-Ziv coding at distortion level D a separately on each subsequence 
associated with each action symbol. It is natural to wonder whether the inequality in $Q can be strict. We will see through 
some examples below that, in general, it may very well be strict. Indeed, even in the absence of a cost constraint, we give 
examples showing that greedily selecting the action associated with the side information which is best in the Wyner-Ziv 
sense, that is the action a minimizing Rwz{Px, Py\x,A=ai D), may be suboptimal. 

The following lemma will be useful in proving Theorem [TJ 

Lemma 1: Properties of the expressions defining R^(D, C): 

1) For any fixed Px and Py\a,X> the set of distributions of the form given in (0 is a convex set in Pa.u\x- 

2) For any fixed Px and Py\A,x> the expression I(X;A) + I(X;U\Y, A) is convex in Pa,u\x (assuming the joint 
distribution given in (0). 

3) To exhaust R^(D, C), it is enough to restrict the alphabet of U to satisfy 

\U\< \X\\A\+ 2. (9) 

4) It suffices to restrict the minimization in (0]i to joint distributions where A is a deterministic function of U, i.e., of 
the form 

Px{x)P u \x{u\x)l {a=s(u) }P Y \x,A{y\x,a). (10) 

Proof: 

1) Since the set of conditional distributions Pa,u\x is a convex set, and since Px and Py\x,A are fixed, the set of 
distributions Px.a.u.y of the form given in (O is a convex set. ■ 

2) Using the definition of mutual information we have the identity, 

I(X; A) + I{X; U\Y, A) = I(X; U, Y, A) + H(Y\A, X) - H(Y\A). (11) 

We show now that the right-hand part of (fTTb is convex in Pu,a\x f° r a fixed Px and Py\A,x- The expression 
I(X;U,Y, A) is convex in Pu,y,a\x> hence it is also convex in Pu,a\x- F° r fixed Px and Py\a,x> the expression 
H(Y\A, X) is linear in Pa\x- Finally, we show that —H(Y\A) is convex using the the log sum inequality that states 
that for non negative number, a%, a-i and b±, hi 

ai log — + a 2 log — > (ai + a 2 ) log . (12) 

b\ b 2 Oi + b 2 

Now let P%\ x = a P\\x + ®Pa\x> w here < a < 1 and a = 1 — a. Let us denote P Y A x and H l {A\X), the joint 
distribution and the conditional entropy induced by P A \ X and the fixed pmfs Px and Py\a,X f° r * = 1, 2, 3. Consider, 

JV^aJlog-^- = (aP Y>A (y, a) + a f y >, a)) log - 

(b) P YA {y,a) _ P YA (y,a) 

< aP Y A (y, a) log ' h aP Y A {y, a) log ' , (13) 

Pi (a) Pj(a) 
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where (a) follows from the definition of Py A x and (b) follows from the log sum inequality. Since (13[ holds for any 
a e A and any y G y, we obtain that — H{Y\A) is convex, i.e., 

- H 3 (Y\A) < -aH 1 (Y\A) - aH 2 (Y\A) (14) 

■ 

3) We invoke the support lemma [4]. The external random variable U must have 1^11*41 — 1 letters to preserve Px.a, 
plus two more to preserve the distortion constraint, the cost constraint and I(A; X) + I(X; U\Y, A). This results in 
alphabet of size \X\ \A\ + 2. ■ 

4) Note that it suffices to restrict the minimization in to joint distributions where A is a deterministic function of U, 
i.e., of the form 

Px(x)P U \x{u\x)l {a=f{u}} P Y \x,A(y\ x ' a )' ( 15 ) 

in lieu of ©. To see the equivalence note that a distribution of the form in (0 assumes the form in ( TTOb by taking 
(U, A) as the auxiliary variable. ■ 

Proof of Theorem Q} 

Achievability: We briefly and informally outline the achievability part, which is based on standard arguments: A code- 
book of size 2™( 7 ( X;j4 ) +£ ) is generated with codewords that are i.i.d.^ Pa- For each such codeword, generate 2™( / ( X;C/ l" 4 ) +e ) 
codewords according to Pu\a- Distribute these codewords uniformly at random into 2™( 7 ( X;C/ l y ' A ) +2e ) bins. Given the source 
realization, n(I(X; A) +e) bits are used by the encoder to communicate the identity of a codeword from the first codebook 
jointly typical with it (with high probability there is at least one such codeword). The decoder now performs the actions 
according to the action sequence conveyed to it. The encoder now uses an additional n(I(X; U\Y, A) + 2e) number of bits 
to describe the bin index of the codeword from the second code-book which is jointly typical with the source and the first 
codeword. With high probability there is at least one such codeword (since more than 2™ 7 ( X;C/ I j4 ) such were generated), and 
it is the only codeword in its bin which is jointly typical with the first codeword (which the decoder already knows) and 
the side information sequence that it has generated and is observed at the decoder, since the size of each bin is no larger 
than w 2 Tl ( / ( x ^l A )- / ( x ; c/ l Y ^)- e ) = 2 n ( I ( Y ' U \ A )- e ) , For the reconstruction, the decoder now employs the mapping X opt 
in a symbol-by-symbol fashion on the components of the pair consisting of the second codeword and the side information 
sequence. 

Converse: For the converse part, fix a scheme of rate < R for a block of length n and consider: 

nR > H(T) 

= H(T, A n ) 

= H(A n ) + H(T\A n ) 

> H{A n ) - H{A n \X n ) + H(T\A n ,Y n ) - H(T\Y n , A n , X n ) 

= I(X n ;A n )+I(X n ;T\A n ,Y n ) 
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= I(X n ;A n )+H(X n \A n ,Y n )-H(X n \A n ,Y n ,T). (16) 

Now 

I(X n ; A n ) + H(X n \A n , Y n ) > H(X n ) - H(X n \A n ) + H(X n ,Y n \A n ) - H(Y n \A n ) 

= H{X n ) - H{X n \A n ) + H(X n \A n ) + H(Y n \A n ,X n ) - H(Y n \A n ) 
= H(X n ) + H(Y n \A n ,X n ) — H(Y n \A n ) 

n 

= Y, H ( x ^ + H ( Y ^ x ^- H ( Y ^ 1 ' An ) 
i=i 

n 
n 

= ^HVCj-HYiiXilAi) 

i=l 
n 

= J2 I ( X ^ A i)+ H ( X i\ Y ^ A i)- < 18 > 
i=l 

Combining ( fTol l and < fT8l > yields 

n 

nR > Y,I(X i -,A i )+H(X i \Y i ,A i )-H(X i \X i - 1 ,A n ,Y n ,T) 

i=l 
n 

=' Yl AA + H(Xi\Yi,Ai) - H(Xi\Yi, A. t , U t ) 

i=l 
n 

= ^I(XiiAi) + I(Xi-,Ui\Yi,Ai), (19) 

i=l 

where (a) follows by taking U { = (A n \\ Y n \\ X l ~\ T). Noting that X t = X l (T,Y n ) is a function of the pair (£/"», Yj), 
and the Markov relation Ui — (A^Xi) — Kj, the proof is now completed in the standard way upon considering the joint 
distribution of (X' , A ,U' ,Y' , X') = (Xj, Aj, Uj,Yj, Xj), where J is randomly generated uniformly at random from the 
set {1, ... , n}, independent of (X n , A n , U n , Y n , X n ), and noting that: 



Px> = Px, U' - (A',X') - Y', P Y '\X>,A> = P Y\X,A, 
X' =X'(U',Y'), 



E 



i=l 



nEp(X',X'), E 



nEA(A') 



and 



- V IiXi-Ai) + I{Xi- Ui\Yi, A,) > I(X';A') + I(X'; U'\Y', A') 

71 ^ — d 



(20) 
(21) 
(22) 

(23) 



i=i 



where last inequality follows from item [2] in Lemma Q] which states that I(X; A) + I(X; U\Y, A) is convex over the set of 
distributions that satisfies (|20| i. ■ 



g 



It is natural to wonder whether the characterization above remains valid when the choice of the actions is allowed to 
depend on the side information symbols generated thus far, that is, for the ith action to be of the form A4 = Ai(T, Y 1 ^ 1 ). 
The converse in the proof above does not carry over to this case since the inequality H(Y n \A n , X n ) > Y^l=i H(Yi\Ai, Xi), 
used in ( fTTI i, may no longer hold. Whether the best achievable rate could, in general, be better (less) when allowing such 
schemes remains open. 

C. Actions taken by the decoder before the index is seen 

Consider the setting as in Figure [U where the actions A n are taken by the decoder before the index T is seen. In such 
a case A n is independent of X n . For this case, the rate distortion cost function is similar to R^'(D,C) defined in the 
previous section, but with an additional constraint that A is independent of X. Define 

R%l x (D,C) = wmI(X;U\Y,A), (24) 

where the joint distribution of X, A, Y, U is of the form 

Px,A,u,Y(x,a,u,v) = Px(x)PA(a)Pu\x,A(u\x,a)P Y \x,A(y\x>a)> (25) 
and the minimization is over all Pa and Pjj\x,a under which 



E 



p(x,X°f*{U,Y)\ <D, E[A(A)}<C, (26) 



where X opt (U, Y) denotes the best estimate of X based on U, Y, where U is an auxiliary random variable with a cardinality 
|W|<|#||.A|+2. 

Theorem 2: The rate distortion cost function for the setting where actions taken by the decoder before the index is seen, 
is given by R^[ X (D,C). 

Proof: The proof is similar to the proof of Theorem [T] but taking into account that A n is independent of X n , and 
therefore Ai is independent of Xi. ■ 
If the cost is unlimited, then the greedy policy is optimal, namely the decoder blindly chooses the action a minimizing 

Rwz(Px,P Y \x,A= a ,D), (27) 

and an optimal Wyner-Ziv code for the source Px and channel Py\x,A=a is employed. For the more general case, in the 
presence of a cost constraint, as can be expected and is straightforward to check, Ra]_x(D> C) m d24b coincides with the 
minimum on the right hand side of (8j. 

D. Examples 

1) The Lossless Case: As a very special case of Theorem Q] we get that, in the absence of a cost constraint on the actions, 
the minimum rate needed for a near lossless reconstruction at the decoder is given by 

minI(X:A) + H(X\Y,A), (28) 
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where the joint distribution of X, A, Y in (1281 1 is of the form 

Px,A,Y(x,a,y) = Px{x)P A \ x (a\x)P Y \x,A{y\x,a), 



(29) 



and the minimization is over all Pa\x- Letting Rsw(Px, Py\x) denote the conditional entropy H(X\Y) induced by the 
pair (Px, Py\x) (the subscript SW standing for 'Slepian-Wolf [9]), it is natural to wonder whether the above minimum rate 
can be strictly better (smaller) than min a Rsw (Px, Py\x.a=o)> which is what would be achieved if the decoder greedily 
takes the one action leading to S.I. which is best in the sense of inducing lowest H(X\Y), irrespective of any information 
from the encoder, and then proceeding as in Slepian-Wolf coding. The following is an example showing that this greedy 
strategy may be suboptimal. 
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Fig. 3. An example of vending side information, where the action chooses between Z-channel and S-channel with parameter <5. 



Consider the case X = A = y = {0, 1} where X is a fair coin flip, Py\x,A=o is the Z-channel with crossover probability 
5 from 1 to 0, and Py\x,A=i is the S-channel with crossover probability 5 from to 1. The setting is depicted in Figure [3] 
Symmetry implies that the P\\x minimizing I(X; A) + H(X\Y, A) satisfies P^i x (0|l) = -P^x (1|0), in other words, there 
is a BSC connecting X to A (or A to X). Assuming this BSC has crossover probability a, an elementary calculation yields 

aS 



I(X; A) + H(X\Y, A) = 1- h(a) + h 
Thus, letting R m i n (S) denote the minimum in d28| i for this scenario 



1 — a 



Rn 



,(S) = min 
e*e[o,i] 



1 - ft (a) + ft 



(1 — a + ad). 



(1 - a + aS) 



(30) 



(31) 



1 — a + aS / 

In contrast, the minimum rate achieved by a 'greedy' strategy which chooses actions without regard to the information from 
the encoder is given by the conditional entropy of the input given the output of the Z-channel((5) whose input is a fair coin 
flip, namely 



Rgreedy(&) — ft 



1 



(32) 



1 + 5) 2 

For example, elementary calculus shows that i? m ;„(l/2) is achieved by a* = 2/5, assuming the value as 0.678072, which is 
about a 1.5% improvement over R gree d y (l/2) as 0.688722. Figure |4] plots the difference between R gr eedy{5) and R m i n {5). 
In the presence of a cost constraint, Theorem [T] implies that the minimum rate needed for a near lossless reconstruction 
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is given by the minimum in d28l ). with the additional constraint EA(A) < C. Let R m i n (8,C) denote this minimum 
for our present example, assuming cost for using say the first Z-channel and 1 for using the second channel. Clearly 

R m m(5,0) = Rgreedy(5), R m in{5, 1/2) = R m in(5) and consequently, by a time-sharing argument, 

Rmin{S, C) < 2CR mm (S) + (1 - 2C)R greedy (5) 0<C< 1/2. (33) 

As it turns out, the inequality in ( 1331 is strict, i.e., in our example one can do better than time-sharing between the respective 
optimum schemes for the different costs (to the level allowed by the cost constraint). Figure[5]contains a plot of i? m ; n (l/2, C), 
which is seen to be better (lower) than the straight line represented by the right side of ( f33b . 




C 

Fig. 5. Plot of R m i n (l/2, C) as a function of the cost C. 



2) The Lossy Case: Ternary Source and Binary Side Information of Unit Cost: Consider a ternary X taking values in 
{ — 1, 0, 1}, distributed according to 

{1 w.p. 1/4 
w.p. 1/2 (34) 
-1 w.p. 1/4. 

The actions are binary, taking values in {0, 1}, where action corresponds to no S.I. while action 1 corresponds to 
obtaining a binary noisy measurement of X, taking values in { — 1,1}, which is the output of the following channel: 
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Fig. 6. Ternary example. 



Py|x(l|l) = Py\x{~ 1| — 1) = 1 an d -Py|x(l|0) = Py\x{~ 1 10) = 1/2. Suppose that there is a unit cost for obtaining such 
a noisy measurement of the source, i.e.: A(a) = a, a G {0, 1}. 

The conditional entropy of X given Y is 1 bit. Thus, lossless compression of X is achievable at a rate of 1 bit per source 
symbol at a cost of 1 per source symbol with a greedy decoder who chooses to observe the noisy source measurement of 
all symbols. Can one do better than this greedy policy? This rate is achievable at half the cost via the following scheme: the 
encoder uses one bit per source symbol to describe whether or not the symbol is 0. The decoder then needs to use the noisy 
measurement of the source only for those symbols that are not (in which case the measurement will completely determine 
the source symbol). This corresponds to rate I(X\ A) + H(X\Y,A) under P A | X (1|1) = Pa|x(1| - 1) = P A\x( \ ) = L 
which is readily verified to be the minimum of achievable rates under a cost constraint of 1/2. 

In the lossy case, under Hamming distortion, we note that: 

• When the S.I. is available to both encoder and decoder (at no cost) the problem is reduced to one of lossy compression 
for the binary symmetric source, thus Rx\y(D) = 1 — h(D), 

• This rate is achievable even when the S.I. is absent at the encoder, as can be seen by letting W be the output of a 
BSC(Z?) whose input Q(X) is the quantized version of X, defined by Q(0) = and Q(l) = Q(— 1) = 1, where 
W - X - Y. It is readily seen that the optimal estimate of X based on (W, Y) satisfies P(X ^ X(W, Y)) = D and 
that I(X; W\Y) = 1 - h(D). Thus R%fr(D) = R x \y{D) = 1 - h(D). 

• R^,y(D) in the above item corresponds to a decoder that observes all of the S.I. symbols. Can the same performance 
be achieved with fewer observations? In other words, assuming unit cost per observation, can the same performance 
be achieved at a cost less than 1? We now argue that the same performance can be achieved at half the cost: letting, 
as before, A = correspond to no observation and A = 1 correspond to an observation, consider a conditional 
distribution P A \ X given by P A | X (1|1) = P A \x(M - 1) = 1 - Pa\x(M°) = D and where U = A. Then I(X;A) + 
I(X;U\Y,A) = I(X;A) = H(A) - H(A\X) = 1 - h(D) and the optimal estimate of X based on (U,Y) has 
P(X ^ X(U, Y)) = D. The cost here is P(A = 1) = 1/2. Evidently, the rate-distortion-cost function R(D, C) in © 
satisfies R(D, 1/2) < 1 - h{D) and in fact R(D, 1/2) = 1 - h(D) since obviously R(D, 1/2) > Pj*iy ( D )- Thus tte 
rate 1 — h{D) is achievable even if the decoder is allowed to access only half of the observations. 
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3) Binary Action: To Observe or Not to Observe the S.I.: Consider a given source and side information distribution Px.y- 
The action is to either observe the side information symbol or not, where an observation has unit cost. Thus < C < 1 
is a constraint on the fraction of side information symbols the decoder will be allowed to observe. Let us arbitrarily take 
A = {0, 1}, with A = 1 corresponding to observation of the side-information symbol and A = to lack of it. Noting that 
the second mutual information term in corresponds to Wyner-Ziv coding conditional on A, the specialization of Theorem 
Q] for this case gives 

R(D,C) 

= JV ,-x^p ( 4^(- )Do+CC ^ ^ A) + * (P *I— ^ ' P{A = 0) + W^l^i, A) ■ P(A = 1) 

where R(Px,D) denotes the rate distortion function of the source Px and Ryvz(Px,Y,D) denotes the Wyner-Ziv rate 
distortion function when source and side information are distributed according to Px.y- 

A very special case is when Y = X. Thus the action is either to observe the source symbol or not. Assuming a non-negative 
distortion measure satisfying min^ p(x, x) = for all x, ( f35T > becomes 

D 



R(D,C)= min I(X;A)+R [P X \ A=0 ,— _ ■ (1 - C). (36) 
When X is a fair coin flip and distortion is Hamming, (l36l l becomes (for D, C in the non-trivial region) 

R(D,C) 

min /(X; A) + R b ( P X=1 \ A=0 , ■ (1 - C) 

min 1 - MP x =i|A=i)C + Wx=i|A=o)(l - C)} + h b (Px=i|A=o) - ^fc (t^V) ' ^ ~ C ^ 
Pa\x-P(A=i)=C yl — u y 

( => i-c-^(t^)-(i-^ 

= Aau(AC) (37) 

where i?h(p, -D) = [hb(p) — hb(D)] + is the rate distortion function of the Bernoulli(p) source and step (a) is due to the fact 
that — hb(Px=i\A=i) i s minimized (at the value —1) by taking A independent of X. 



To see that R(D, C) can be strictly smaller than Ra±x(P > , C) in the observe/not-observe binary action scenario, consider 
the case where X is a fair coin flip and Y is the output of an erasure channel with erasure probability e (whose input is 
X). Recalling that Rvvz{Px.y , D) = eR(Px,D/e) when Y is the erased version of X (cf. [8], [10]), we specialize the 
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right hand side of ( f35l > for this case to obtain 

R(D,C) 
= min 

P A \ X :A-X-Y,P(A=1)=C,(1-C)D +CD 1= D 



1 - H{X\A) + R b (P xlA=0 (l),D ) ■ (1 - G) + eRb{P x \A=iO), Dx/e) ■ C 



min 1 — 







2(1 -C) 



(1 - C) + h b 



1-/3 
2C 



/? D — CDx 



2(1 -C) ! 1-C 



(l-C)+ei? 6 



1-/? 
2C : 



Di/e C 



(38) 



where the last minimum is over max{0, 1 — 2C} < (3 < min{l, 2 — 2C} and < D\ < min{D/C, e}. For the extreme points 
we get, as expected: R(D,0) = R b (\,D) and R(D,1) = eR b (|,D/e). Figure [7] plots the curve in (O for D = 1/4, 
e = 1/2 and < C < 1. 




Fig. 7. Rate distortion cost function R(D, C), < G < 1, for the case where X is a fair coin flip, Y its erased version where e = 1/2, C is fraction 
of places where decoder is allowed to observe S.I., and D = 1/4. In this case R(D,0) = R b (|, \) « 0.188722, 1) = eR(, (|,Z>/e) = 

0. The strict concavity implies sub-optimality of time-sharing optimal schemes according to the available observation budget. 



E. Causal Decoder Side Information 

Consider the setting presented in Figure [8] which is similar to that described in Section III- Al the only difference being that 
the reconstruction is allowed causal dependence on the side information, i.e., to be of the form X; = Xi(T, Y l ) (motivation 
for why this might be interesting can be found in [12]). 

Define 

R { cLai(D,C) =wmI(X;U,A), (39) 
where the joint distributions of X, A, Y, U is of the form 

Px,A,u,v{x,a,u,y) = P x (x)P A , u\x (a, u\x)P Y \x, a (y\x, a), (40) 
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Y t ~ P Yi \ Ai , Xi 



Vender 



Fig. 8. Rate distortion with causal side information vender at the decoder. 



and the minimization is over all Pa,u\x under which 



E 



p(x,X opt (U,Y)^ < D, E[A(A)]<C, 



(41) 



where X opt (U, Y) denotes the best estimate of X based on U, Y, where U is an auxiliary random variable. The cardinality 
of U may be restricted to \U\ < \X\\A\ + 2 as shown in item [3] Lemma Q] One can also denote U, A as U, and an equivalent 
representation would be 



^w(Atf)=min/(X;tO, 



(42) 



where P XA ^ y (x, a, u, y) = Px{x)P i j^ x {u\x)l {a=f ( il)} P Y \ XA (y\x,a). 

Theorem 3: The rate distortion cost function for the setting where actions taken by the decoder before the index is seen, 
is given by R% usal (D, C). 

Proof: Achievability: The achievability proof is based on the fact that the encoder and decoder generate a joint type 
Px,A using a rate that is I(X; A) + e, and since both the encoder and decoder know the sequence of actions a n , they 
can time-share between |*4| causal schemes such that if the action is a a rate I(X; U\a) + e would achieve the distortion 
constraint [12]. Hence, the total rate is I(X; A) + e + J2 PA(a)(I(X; U\a) + e) = I(X; A, U) + 2e. 
Converse: for the converse part, fix a scheme of rate R for a block of length n and consider: 

nR > H{T) 

> I(X n ;T) 

n 

= ^HiXA-HiX^X'-^T) 



(a) 



J2H(Xi) - HiXilX*- 1 ,^- 1 ) 



(43) 



; = 1 



where step (a) is due to the Markov chain X.i — (X 1 1 , T) — Y % 1 . Now let us denote Ui := (T, Y l 1 ), and we obtain that 



R>-I(Xi;Ui). 
n 



(44) 
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A 



The proof is now completed in the standard way upon considering the joint distribution of (X' , A' ,U' ,Y' ,X' 
(Xj, Aj, (Uj, J), Yj, Xj), where J is randomly generated uniformly at random from the set {1, . . . , n}, independent of 
(X n , A n , U n , Y n , X n ), and noting that: 



P x > = Px, U' - (A', X') - Y', P Y >\X-,A> = p y\x,a, 
X' = X'(U',Y'), A' = f(U'), 



E 



nEp(X',X'), E 



and 



.i=i 

-T / I(X i ;U i )=I(X';U'). 

11 ^ ' 



= nEA(A') 



(45) 
(46) 
(47) 

(48) 



F. Indirect Rate Distortion with Action-Dependent Side Information 



x n 


Pz\x 


Z n 


Encoder 


T{Z n ) e 2 nR ^ 









Decoder 



A n (T) 



X n (Y n ,T) 



Y' 



Vender 



Fig. 9. Indirect rate distortion with a side information vender at the decoder. The source X n is i.i.d. Px and the encoder observes a noisy version 
of the source, Z™, where the pairs (Xi,Zi) are i.i.d.~ Px,z- Side information is generated as the output of the channel Py\x,z,A m response to the 
noise-free, noisy, and action sequences (X n ,Z n , A n ), where the action sequence A n is generated on the basis of the index from the encoder. 



Consider the case shown in Figure [9] where, rather than the source X, the encoder observes a noisy version of it, Z. 
The decoder, based on the index conveyed to it from the encoder, will then select an action sequence that will result in 
the side information Y, as output from the channel Py\x.z.A- The reconstruction, as before, will be a function of the 
index and the side information. Specifically, a scheme in this setting for blocklength n and rate R is characterized by 
an encoding function T : Z n — ► {1, 2, . . . , 2 nR }, an action strategy / : {1,2,..., 2 nR } — > A n , and a decoding function 
g : {1, 2, ... , 2 nR } x y n -> X n that operate as follows: 

. The source ?^-tuple X n is i.i.d.^ Px goes through a DMC Pz\x ^ yield its noisy observation sequence Z n . Thus, 
overall the clean and noisy source are characterized by a given joint distribution Px.z 

• Encoding: based on Z n give index T = T(Z n ) to the decoder 

• Decoding: 

- given the index, choose an action sequence A n = f(T) 
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- the side information Y" will be the output of the memoryless channel Py\x.z.A whose input is (X n , Z n , A n ) 

- let X n = g(T,Y n ) 

The rate-distortion-cost for this case is now defined similarly as in subsection III-AI Let us denote it by Rju(D,C), the 
subscript standing for 'indirect'. Theorem Q] is generalized to this case as follows: 
Theorem 4: Rjd(D,C) is given by 

R ID (D,C)= min [I{Z; A) + I(Z; U\Y, A)] , (49) 
where the joint distribution of X, Z, A, Y, U is of the form 

Px,z,A.y.y {x, z, a, u, y) = P x ,z(x, z)P A:U \ z (a, u\z)P y \x,z,a(v\x, z, a), (50) 
and the minimization is over all Pa,u\z under which 

E \p (x, X opt (U, Y))] <D, E [A(A)] < C, (51) 



where X opt (U, Y) denotes the best estimate of X based on U, Y, and U is an auxiliary random variable whose cardinality 
is bounded as \U\ < \Z\\A\ + 2. 

Proof outline: The achievability part is very similar to the original. The random generation of the scheme is performed in the 
same way, with the noisy source replacing the original noise-free source. This guarantees that (Z n , A n , U n , Y n ) are, with 
high probability, jointly typical. The joint typicality also with X n , namely the joint typicality of (X n , Z n , A n , U n , Y n ), then 
follows from an application of the Markov lemma. The converse part also follows similarly to the one from the noise-free 
case: that 

nR > I(Z n ;A n ) + H(Z n \A n ,Y n ) - H(Z n \A n , Y n ,T) (52) 
follows identically as in ( [T6l ) by replacing X n by Z n . That 

I{Z n ] A n ) + H{Z n \A n ,Y n )>Y J I(Z l] A l ) + H{Z l \Y l ,A l ) (53) 

i=l 

follows similarly as <[T8} by replacing X n with Z n , upon noting that H(Y n \A n , Z n ) = J^" =1 H(Yi\Ai, Zi), which follows 
from the Markov relation (X h Yi)-(Ai, Zi)-{A n \\ Z n \\ (which a fortiori implies Y i ~(A l , Zi)-{A n \\ Z n \\ 

Combining ( 1521 and ( 1531 now yields 

n 

nR > ^2 1(Zi,Ai) + I(Xi; Ui\Yi, A t ) (54) 

i=l 

similarly as in Step (a) in ( fT9l ) upon defining U { = (A^ 1 , Y n \ l , Z*' 1 , T). The proof of the converse is concluded by 
verifying that: 

. X, = Xi(T, Y n ) is a function of the pair {U l ,Y i ) 

. the Markov relation Xi~Zi- (A u U t ) holds (which follows from X t - Z l - (Z n , Y n \' 1 )) 

. the Markov relation U, - {X u Z h A t ) - Y t holds (which follows from (Z n , Y n \ l ) - (X { , Zi, A,) - Y t ) 

and invoking the convexity of the informational rate distortion function defined on the right hand side of d49l , which is 
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established similarly as in Lemma Q] ■ 

III. Side Information Vending Machine at the Encoder 

In this section we consider the setting where the action sequence A n is chosen at the encoder and the side information is 
available at the decoder and possibly at the encoder too. The setting is depicted in Figure [2] Specifically, a communication 
scheme in this setting for blocklength n and rate R is characterized by an action strategy 

/:#"-> A n , (55) 

an encoding function 

T : X n x y n — > {1, 2, . . . , 2 nR } (when side information is available at the encoder), 
T : X n — * {1, 2, . . . , 2 nR } (when side information is not available at the encoder), 

and a decoding function 

g : {l,2,...,2 nR } xy n ^ X n . (56) 

As in the case where the actions were chosen by the decoder, the side information Y n will be the output of the memoryless 
channel Py\x,A whose input is (X n ,A n ). Furthermore, a triple (R,D,C) is said to be achievable if for all e > and 
sufficiently large n there exists a scheme as above for blocklength n and rate R + e satisfying both 



E 

and 



i=i 



E 



<n(D + e) (57) 



<n(C + s). (58) 



The rate distortion (and cost) function R e (D, C) (The letter e stands for encoder) is defined as 

R e (D, C) = inf{R' : the triple (W, D, C) is achievable}. (59) 

The general case remains open, however we present here a characterization of three important cases: lossless case (where 
Pr(X™ = X n ) — > 1), Gaussian case (where Y = A + X + N and X and N are independent Gaussian random variables), 
and a case where the Markov form Y — A — X holds. In all three cases, R e (D, C) is independent of whether or not the 
S.I. is available at the encoder. 

A. Lossless case 

Here we consider the lossless case, namely, for any e > there exists an n such that Pr(X™ = X n ) > 1 — e. Define 

R { e I] (C) = min [H(X\A, Y) + I(X; A) - I(Y; A)} , (60) 

where Px and Py\A,x are determined by the problem setting and the minimization is over Pa\x sucn tnat E [A(A)] < C. 
The term [H(X\A, Y) + I(X;A) - I(Y;A)] is convex in P A \ X since the term -I(Y;A,X) is convex in P A \ X and tne 



IS 



following identity holds 



H(X\A,Y)+I(X;A)-I(Y;A) 



H(X\A, Y) + H(X) - H{X\A) - H{Y) + H(Y\A) 



H(X) 



I(X; Y\A) — H(Y) + H(Y\A) 



H(X) 



H{Y\A) + H(Y\A, X) - H(Y) + H(Y\A) 



H(X) 



I(Y;A,X) 



(61) 



Let us denote the minimum (operational) rate that is needed to reconstruct the source at the encoder losslessy where with 
a cost of the action less than C as R e (C). 

Theorem 5: For the setting in Figure [2] where the actions are chosen by the encoder and the side information Y is known 
to the decoder and may or may not be known to the encoder the minimum rate that is needed to reconstruct the source 
under a cost constraint C is given by 



Achievability : The achievability proof is divided into two cases according to the sign of the term I(X; A) ~ I(Y; A). In 
the first case we assume I(X; A) — I(Y; A) > and we use a coding scheme that is based on Wyner-Ziv coding [15] for 
rate distortion theory where side information known at the decoder. In the second case, we assume I(Y; A) — I(X; A) > 
and we use a coding scheme that is based on Gel'fand-Pinsker coding [6] for channel with states where the state is known 
to the encoder. 

First case I(X; A) — I(Y; A) > : We first generate a codebook of sequences of actions A n that covers X n ; hence, the 
size of the codebook needs to be 2 n ( I ( A ' X " >+e \ where e > 0. Then, similarly to Wyner-Ziv coding scheme [15], we bin the 
codebook into 2™( / ( j4;JS<: )~ / ( y;A ) +2e ) bins such that into each bin we have 2"( 7 ( Y; ^~ e ) codebooks. Similarly to Wyner-Ziv 
scheme, we look in the codebook for a sequence A n that is jointly typical with X n and transmit the number of the bin 
that contains the sequence. The decoder receives the bin number and looks which of the sequences of A n in the bin that its 
number is received are jointly typical with the side information Y n . Similar to the analysis in Wyner-ziv setting, with high 
probability there will be only one codeword that is jointly typical with Y n (The Markov form that is needed in the analysis 
of Wyner-ziv setting is not needed here, since the side information Y n is generated according to Py\A,X an d therefore 
if (A n ,X n ) are jointly typical then with high probability the triple (A n ,X n ,Y n ) would also be jointly typical). In the 
final step the encoder uses a Slepian-Wolf scheme for transmitting X n losslessy to the decoder that has side information 
(Y n ,A n ); hence additional rate of H(X\Y,A) is needed. 

Second case I(Y;A) — I(X;A) > : First we notice that the expression in d62l can be written as H(X\A,Y) — 
(I(Y; A) — I(X; A)). The actions can be considered as input to a channel with states where the output of the channels is 
Y and the state is X and the conditional probability of the channel is Py\x,A- The capacity of this channel is achieved by 
Gel'fand-Pinsker coding scheme [6] and is given as I(Y; A) — I(X;A). In addition the Gel'fand-Pinsker coding scheme 
induces a triple (X n ,Y n , A n ) that is jointly typical. Hence, we can use the message in order to reduce the needed rate 



R e (C) = RW(C). 



(62) 



Proof: 
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H(X\Y, A) as in Slepian-Wolf scheme to H(X\A, Y) - (I(Y; A) - I(X; A)). 

Converse: for the converse part, fix a scheme of rate R for a block of length n with a probability of error Pr(X n ^ 
X n ) = P^ and consider: 

nR > H(T) 

> H(T\Y n ) 

= H(X n , T\Y n ) - H(X n \T, Y n ) 

> H(X n ,T\Y n )-ne n 
C = } H(X n ,A n \Y n ) -ne n 

= H(X n , A™, Y n ) — H(Y n ) — ne n 

= H(X n ) + H{A n \X n ) + H{Y n \A n ,X n ) — H(Y n ) — ne n 

(c) n 

> ^iT(X i )+ff(y i |A i ,X i )-ff(y i )-ne„ 
1=1 

> mm[n(H(X) + H(Y\A,X)- H(Y))]-ne n (63) 

where e n = log \ X\ + i and step (a) follows Fano's inequality. Step (b) follows the fact that A" and T are deterministic 
functions of X n . Step (c) follows the following four relations: P(x n ) = YYl =1 P{xi), P(y n \a n ,x n ) = Il"=i P (Vi\ a i> x i) 
H(A n \X n ) = 0, and H(Y n ) < Y% =1 H{Yi). The minimization in the last step is over all conditional distribution Pa\x 
that satisfy the cost constrain, namely E [A(Aj\ < C, and the inequality follows from the fact that the expression H(X) + 
H(Y\A, X) — H{Y) is convex in Pa\x for fixed Px and Py\A.x- The converse proof is completed by invoking the fact 
that since R is an achievable rate there exists a sequence of codes at rate R such that e„ — > 0. ■ 
We have seen that, in the absence of a cost constraint on the actions, the minimum rate needed for a near lossless 
reconstruction at the decoder is given by 

mmI(X;A) + H(X\Y,A)-I(Y;A) (64) 

(regardless of whether or not the side information is present at the encoder). Thus, I(Y; A) represents the saving in rate 
relative to the case where the actions are taken by the decoder (recall (|28l l for the minimum rate at that case). To see that 
this can be significant, recall the example X = A = y = {0,1}, where X is a fair coin flip, Py\x,A=o is the Z-channel 
with crossover probability 5 from 1 to 0, and Py\x,A=i is tne S-channel with crossover probability S from to 1, It is 
easily seen that in this case I(X; A) + H(X\Y, A) — I(Y; A) = and so, a fortiori, the minimum in d64t is zero. That the 
source can be reconstructed losslessly with zero rate in this case is equally easy to see from an operational standpoint, since 
taking actions Ai = Xi ensures that Yi = Xi with probability one. 



B. Gaussian Case 

Here we consider the case where 



2(1 



the source has a Gaussian distribution with zero mean and variance a\, i.e., 

X~N(0,<4), 

the relation between Y, X, A is given by 

Y = X + A + N, 



(65) 



(66) 



where N is a random variable independent of (A,X) and has a Gaussian distribution with zero mean and variance 



Off, i.e., 

W ~N(0,<4), 
the distortion is a mean square error distortion, i,e, E 



(67) 



and it has to be less than D 



£ti(**-*i) a 

the cost of the actions is E [X)™=i ^f] an d nas to t> e less than C. Without loss of generality, we assume that C = o?a\ 
where a > 0. 



Theorem 6: For the Gaussian setting of Figure as described above, 



R e (D,C) 







if [(1 + y/C/a x fal +a%]-D< a x a 2 N 
otherwise. 



(68) 




0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 

D 

Fig. 10. R e (D, C) in the Gaussian case, for cr\ = cr^ = 1. The boundary of the region where R e {D, C) = is the curve D = ir= — . Indeed, 

(l+vC| +1 

this distortion level can be achieved with zero rate by estimating X on the basis of Y = X + N/ (1 + VC). 

Before proving Theorem [6] we would like to point out that the state amplification problem [16], [17] is tangent to the 
vending side information problem described here. In the state amplification problem, the goal is to design a communication 
scheme for a channel with i.i.d. states sequence, S", which is known to the encoder. The purpose of the scheme is to send 
a message through the channel, and at the same time to describe to the decoder the state sequence S n . The case where 
there is no message to send, namely, the input to the channel is used only to describe the state sequence, is equivalent to 
the problem presented here when R e (D,C) = 0, namely, we when are using only the actions to describe the source and 
no additional message is sent. If R e (D,C) = 0, we obtain from ( f68l l that for the Gaussian source coding problem, the 
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minimum mean square error satisfies 



D > - " -, (69) 

(a x + VC) 2 + <J 2 N 



a result that was also obtained in [17, Theorem 2], where the channel is the Gaussian channel and the goal is to describe 
the state sequence with minimum mean square error distortion. 
Proof of Theorem [6} 

Achievability: The encoder chooses the actions to be A = aX and then it uses a coding for the Gaussian Wyner-Ziv with 
side information at the decoder [13]. The side information satisfies Y = X + A + N = (1 + a)X + N, which is equivalent 

N 
(1+a)- 



to having a side information Y = X + nq^jy- Denote by N' = Using the Gaussian Wyner-Ziv result, a rate 



R=l\og a °"' g| = llog 7 $L _i (70) 

2 8 a\ + ajf, D 2 S (1 + a)V| + a 2 N D 

is achievable. 

Converse: We prove the converse in two steps. First we derive the lower bound 

R e (D,C)> min I(X;X) - I(Y;X, A), (71) 

P A\X-Px\ x ,A,Y 

which holds for any Px and Py\A,x ( n °t necessarily Gaussian), and then we evaluate it for the Gaussian case. 
Fix a scheme at rate R for a block of length n and consider 

nR > H(T) 

> H(T\Y n ) 

> I(X n ;T\Y n ) 

=' I(X n ;Y n ,T)- I{X n -Y n ) 

> I(X n ;X n ) - I(X n ,A n ;Y n ) 

(c) n 

> ^/(JSQ;!*)- I{Xi,Ai\Yi), (72) 

i=l 

where (a) follows from [14, Lemma3.2], which asserts that for arbitrary random variables I(X, Z; Y) = I(Z; Y)+I(X; Y\Z). 
Step (b) follows from the facts that X n is a determinstic function of the pair (T, Y n ), and A n is a deterministic function of 
X n . Step (c) follows from the facts that H(X n ) = £™ =1 H(XA, H(Y n \A n ,X n ) = £" =1 H(Yi\Ai, X { ) and conditioning 
reduces entropy. Since the expression in d72l ) is convex in Pa\Xi Pjt\x ^ or ^ xe< ^ Px an d Py\A,x> we obtain the lower bound 

R e (D,C) > minI(X;X) - I(Y;X,A), (73) 

and the minimization is over conditional distributions Pa\XtP x \x tnat sat isfy the distortion and cost constraints. 
Now we evaluate the lower bound for the Gaussian case. 

I(X; X) — I(Y; X, A) = H(X) - H(X\X) - H(Y) + H(Y\A, X) 
= H(X) - H(X\X) - H(Y) + H(N) 
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( Q ) 1 1 1 1 

> - log 2nea x - - log 2iteD - - log 27re((l + afa 2 x + a%)) + - log 2neo% 

= I W ^ (74) 

2 8 (1 + «) J 4 + a 2 N D ' 

where inequality (a) follows from the fact that H(X\X) < H(X — X) < ilog27re£> (because of the constraint that 

E(X - X) 2 < D) and H(Y) < \ \og2nec7 2 ,, where 

a Y = E[X 2 ] + 2E[AX] + E[A 2 } + E[N 2 ] 



< E[X 2 ] + 2y/E[X 2 ]E[A 2 ] + E[A 2 ] + E[N 2 ] 



< a\ + 2aa 2 



x T AU.u x T ct u x T u N 
2 /i i „ \2 



<t x (1 + a) 2 + aff. (75) 



C. Markov Form Y-AX 

Here we consider the case where the Markov form X — A — Y holds. 

Theorem 7: The rate distortion (and cost) function R e (D, C) for the setting in Figure[2]when PY\A.x{y\ a i x ) = PY\A{y\ a ) 



satisfies 



R e {D,C)= mm{I{X-X)-I{A;Y)) 



(76) 



where [•]+ denotes mm{-,0} and the minimization is over joint distributions of the form P A x Y x (a, x, y, x) = 
Px{x)P A \x(a\x)P YlA {y\a)P xlx {x\x) satisfying Ep{X,X) < D and EA(A) < C. 

It is interesting to note that the solution is the difference between a rate-distortion expression rxiinp^ I(X;X) and 
channel capacity expression m&xp A I(A; Y) . I.e., 

R e (D,C) = [R(P Xl D)-Cap(P YlA ,C)} + , (77) 

where Cap(Py\A,C) denotes the capacity of the channel Py\A under a cost-constraint C. 
proof of Theorem [7} 

Achievability: Design a regular rate distortion code, which needs a rate larger than I(X; X), and then transmits part of 
the rate through the channel which has an input A and output Y. Therefore the total rate that is needed to be transmitted 
through the index T(X n ) is the difference I(X; X) - I (A; Y). 

Converse: We invoke the lower bound given in d73b and obtain 

R e (D,C) > I(X;X)-I(Y;X,A) 

= I(X;X)-I(Y;A), (78) 

where the last equality is due to the Markov form X — A — Y. ■ 



23 



D. Upper and Lower Bounds for the General Case 

1 ) Achievable Rates: 

• Absence of S.I. at Encoder: For the setting of Figure [2] with an open switch, i.e., when the encoder has no access to 
the S.I., the following is an achievable rate: 

I(U;X\A,Y)+I(X;A)-I(Y;A) (79) 

under any joint distribution of the form 

Px {x)Pa\x {a\x)P Y \x,A (y\x, a)Pu\ x ,A (u\x, a) 

such that Ep(X, X opt (A, Y,U)) < D and EA(A) < C. The argument for why this rate is achievable is similar to that 
given in Subsection IIII-AI for why the right side of d62l ) is achievable, the difference being that the H(X\A,Y) term 
in d62l i. corresponding to Slepian-Wolf coding of X n conditioned on A n , is replaced by I(U;X\A,Y), corresponding 
to Wyner-Ziv coding conditioned on A n . 

• S.I. Available at Encoder: For the setting of Figure [2] with a closed switch, i.e., when the encoder has access to the 
S.I., the following is an achievable rate: 

I(X;X\A,Y)+I(X;A)- I(Y;A) (80) 

under any joint distribution of the form 

Px(x)P A \x(a\x)P Y \x,A(y\x, a)P x \x,A,Y ( x \ x > a > v) 

such that Ep(X, X) < D and EA(A) < C. The argument for why this rate is achievable is similar to that for why 
the right side of ( f80b is achievable, the difference being that the I(U; X\A,Y) terms, corresponding to Wyner-Ziv 
coding conditioned on A n , is replaced by I(X; X\A,Y), corresponding to standard rate distortion coding conditioned 
on A n ,Y n . 

2) Lower Bound on Achievable Rate: As pointed out in Subsection IIII-BI the proof of the converse part of Theorem [6] 
is valid for the general case (i.e., beyond the Gaussian scenario), and shows that the rate needed to achieve distortion D at 
cost C, regardless of whether or not S.I. is available at the encoder, is at least as large as 

I(X;X)-I(Y;X,A) (81) 

for some joint distribution of the form 

Px(x)P A \x(a\x)P Y \x,A{y\ x > a ) P X\X,A,Y^\ X ^ a > V) 

satisfying the distortion and cost constraints. It is worthwhile to note that this rate was shown to be achievable for the three 
special cases considered in the previous three subsections. Indeed, this fact was shown explicitly for the cases of Subsection 
HElland Subsection HTFC] and in the lossless case <HD becomes H(X) — I(Y; X, A) = H(X\A, Y) + I(A; X) — I (A; Y), 
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which coincides with the expression on the right hand side of ( l62l >. 

To see that the lower bound in ( T8TT > may not be tight in general, even when the S.I. is available at the encoder, consider 
the standard case of rate distortion coding with S.I. available to both encoder and decoder. In this case A is degenerate, so 
the right hand side of d8"TT l reduces to 

I(X; X) - I(Y; X, A) = H(X\Y) - H(X\X) (82) 
while the tight lower bound on the achievable rate for this scenario is well-known to be given by 

I(X;X\Y) = H(X\Y)-H(X\X,Y), (83) 
which may be strictly larger than the expression in d82l . 

IV. Summary and Open Questions 

We have studied source coding in the presence of side information, when the system can take actions that affect the 
availability, quality, or nature of the side information. We have given a full characterization of the rate-distortion-cost 
tradeoff when the actions are taken by the decoder. For the case where the actions are taken by the encoder, we have 
characterized this tradeoff in a few important special cases, while providing upper and lower bounds on the achievable rate 
for the general case. 

The most significant question left open by our work is a full characterization of the rate-distortion-cost tradeoff for the 
setting of actions taken at the encoder (beyond the special cases considered here), with S.I. that may or may not be available 
at the encoder (Figure |2). Another question left open, for the setting of actions taken by the decoder, is whether the rate 
distortion cost tradeoff can be improved when each action is allowed to depend on the side information symbols generated 
thus far, that is, when the ith action is allowed to be of the form Ai = Ai(T, (rather than Ai(T)). 
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